Machine Learning Study of Metabolic Networks vs ChEMBL Data of Antibacterial Compounds

Antibacterial drugs (AD) change the metabolic status of bacteria, contributing to bacterial death. However, antibiotic resistance and the emergence of multidrug-resistant bacteria increase interest in understanding metabolic network (MN) mutations and the interaction of AD vs MN. In this study, we employed the IFPTML = Information Fusion (IF) + Perturbation Theory (PT) + Machine Learning (ML) algorithm on a huge dataset from the ChEMBL database, which contains >155,000 AD assays vs >40 MNs of multiple bacteria species. We built a linear discriminant analysis (LDA) and 17 ML models centered on the linear index and based on atoms to predict antibacterial compounds. The IFPTML-LDA model presented the following results for the training subset: specificity (Sp) = 76% out of 70,000 cases, sensitivity (Sn) = 70%, and Accuracy (Acc) = 73%. The same model also presented the following results for the validation subsets: Sp = 76%, Sn = 70%, and Acc = 73.1%. Among the IFPTML nonlinear models, the k nearest neighbors (KNN) showed the best results with Sn = 99.2%, Sp = 95.5%, Acc = 97.4%, and Area Under Receiver Operating Characteristic (AUROC) = 0.998 in training sets. In the validation series, the Random Forest had the best results: Sn = 93.96% and Sp = 87.02% (AUROC = 0.945). The IFPTML linear and nonlinear models regarding the ADs vs MNs have good statistical parameters, and they could contribute toward finding new metabolic mutations in antibiotic resistance and reducing time/costs in antibacterial drug research.


INTRODUCTION
Antibiotics have established themselves as the bedrock of modern medicine. However, the World Health Organization (WHO) in January 2017 produced a list of worldwide priorities for antibiotic-resistant microorganisms. 1 The continued efficacy of antibiotics is jeopardized by the global spread of antibiotic resistance determinants, a process facilitated to a great extent by inappropriate use of antibiotics in clinical, community, and agricultural contexts. 2 To design successful next-generation antibacterial medicines, we must first have a deeper understanding of how bacteria respond to antibiotics. 3 Molecular screenings have identified compounds that limit bacterial growth in vitro. Despite the abundance of bioactive chemicals, only a few biological functions are targeted. 4 Antibiotics that disrupt these energy-consuming pathways disrupt the metabolic balance.
Levy et al. 5 proposed in 2004 that antibiotics have a finite duration of clinical value before being compensated for the inevitable emergence of resistance. Thus, new antibiotics are critical in fighting bacterial resistance. 6 The majority of newly licensed antibiotics are chemically modified variants of established medication classes; several are found naturally. 7,8 As a result, bacterial strains may rapidly evolve resistance mechanisms to these analogues if their existing resistance mechanisms do not already display partial cross-effectiveness. 9 Furthermore, this bacterial resistance to conventional antibiotics has also been attributed to the excessive use of broad-spectrum antibiotics, 10 which requires scientists to find fast, accessible, and cheap methods for discovering new drugs and target molecules against infectious microorganisms. In this regard, an understanding of pathogen metabolism is critical. Metabolic networks (MN) are made up of metabolic pathways, which are a series of biochemical reactions in which the result (output) of one reaction acts as a substrate (input) for another reaction. 11 Novel applications of MN reconstructions of human pathogens have recently been described. These studies have focused on elucidating resistance metabolic dependencies and identifying potential drug targets and antibiotics. 12−14 The influence of the changes in MNs on the capacity of various microorganisms to survive has been demonstrated by Barabaśi′s team and other authors. 15,16 On the other hand, the importance of metabolic mutations in antibiotic resistance is frequently underestimated. 17 Recently, Lopatkin et al. 18 demonstrated that metabolic mutations arise in clinically relevant bacteria in response to antibiotic therapy. They used a variety of in vitro evolution procedures and comprehensive sequencing data analysis. The use of Escherichia coli as a model pathogen demonstrated that metabolic alterations can arise in response to antibiotic treatment. 18 This research has provided a new perspective on the development of antibiotic resistance by shedding light on the complexities of metabolic alterations. 3 Their findings may assist in explaining the prevalence of multidrug-resistant bacterial strains isolated in areas with little or no antibiotic exposure, as well as the documented increase in antibiotic resistance following extensive herbicide or other environmentally hazardous substance application. 18 Antibacterial drugs (AD) change the metabolic status of bacteria, resulting in bacterial mortality, for example, through oxidative damage or stasis through translation inhibition, resulting in decreased cellular respiration. 3 The bacterium's metabolic state has an effect on antibiotic sensitivity; thus, altering the metabolic state can increase antibiotic efficacy. 3,17 In this sense, the interaction of ADs and MNs can contribute toward finding new metabolic mutations in antibiotic resistance, mainly regarding (multi)drug-resistant bacteria.
On the other hand, prediction using computer models has been widely employed as a significant alternative to obtain experimental data and save resources and research time in drug discovery and development. 19,20 These methods allow scientists to establish relationships between many datasets and structural molecular information that contributes to biological activity to solve complex problems. 21 Additionally, machine learning (ML) enables us to process data in terms of molecular descriptors. Traditional methods for getting metadata from complex databases of preclinical assays are not good enough. One example of a traditional method is the ChEMBL database, which collects big datasets from a variety of heterogeneous and independent sources and aims to investigate complicated and dynamic interactions between data from preclinical trials. 22 Numerous cheminformatics and other computational techniques have been developed to assist in the discovery of ADs against various bacteria. However, the techniques are limited to predicting the drugs' biological activity in a certain strain under specified conditions. 23 Multitask quantitative structure−biological effect relationship (mtk-QSBER) models have attempted to address these drawbacks. 24 They allow the integration of multiple chemical and biological data types, enabling the simultaneous prediction of pharmacological activities, toxicities, and/or other safety concerns. 25 Different approaches have been presented in the antibacterial field to estimate biological activities and the ADMET characteristics (absorption, distribution, metabolism, elimination, and toxicity) of diverse chemical compounds at the same time. For example, anti-Pseudomonas activity, 26 antituberculosis effects, 27 activity against bacteria present in noma 28 or against Gram-negative bacteria, 29 or to predict effective anti-staphylococcal agents. 30 Gonzaĺez-Díaz et al. developed IFPTML [Information Fusion (IF) + Perturbation Theory (PT) + Machine Learning (ML)], 31 a technique for ML with multiple outputs and input-coded labels to address this type of challenge. The scoring function f(s ij ) calc is produced by the IFPTML model. IFPTML has been applied to complicated data analysis in molecular sciences, 31,32 infectious disease, 33 nanotechnology, 34,35 etc. Drugs, drug cocktails, proteins, genes, vaccines, MNs, and complex networks have all been implicated in these issues. 16,32,36−38 The present study proposes a solution for this type of data by combining the basics of information fusion (IF), perturbation theory (PT), and machine learning (ML) approaches to create an IFPTML model. 35,39−42 This paradigm is particularly well suited for databases with comparatively huge data characteristics and combinatorial information. This paper analyzed a large dataset (>155,000 preclinical assays) against different bacteria downloaded from the ChEMBL database. We merged this dataset with structural data for over 40 MNs from a variety of microorganisms previously reported by Barabaśi's laboratory team. 15 In all of these cases, those   36 We employed moving average (MA) operators to describe the assay perturbations and PT multiplier operators (PTOs) to achieve data combination and dimension reduction. Finally, we used linear discriminant analysis (LDA) and nonlinear ML algorithms to find the best IFPTML predictive model. Figure 1 illustrates the overall approach for developing the IFPTML model for ADs vs MNs.

MATERIALS AND METHODS
2.1. ChEMBL Dataset of Antibacterial Compounds. We downloaded a large dataset of preclinical assays of ADs from the ChEMBL database. The dataset was created through a data fusion process between the ChEMBL dataset and Barabaśi's group MNs released by Jeong et al. 15 In this sense, we only searched in the ChEMBL database biological activity assays of ADs against organisms present in the MNs dataset. The steps carried out were as follows.
In the ChEMBL dataset, the organisms were searched for using targets and assays and saved in an Excel file. See details about this compound in Supporting Information S00 (xlsx). Subsequently, we merged the datasets obtained with each keyword into a single file. Later, we performed the data curation, eliminating all duplicate cases and reporting no biological activity value. The data of the organisms Methanococcus jannaschii and Treponema pallidum were excluded since the two compounds reported in the ChEMBL have no biological activity measured. After data curation, we found that the ChEMBL AD activity dataset comprises values for >300 parameters (MIC, IC50, etc.) for >155,000 biological tests involving >50,000 compounds vs >25 bacteria species. Table S01 (Supporting Information S01) shows the statistics for multiple types of biological activity parameters in the ChEMBL dataset.

IFPTML Analysis
Steps. The IFPTML analysis process is divided into three phases (IF + PT + ML). The IFPTML technique workflow for AD vs MN analysis is depicted in Figure 2, along with the general processes discussed in this research. The initial step in the IF phase is to obtain values v i and v j for the different biological properties c d0 and c s0 of the two subsystems (AD and MN). Following that, we preprocessed all observed values using a variety of units, scales, and degrees of uncertainty to create dimensionless functions that characterize the system as a whole, as well as the AD vs MN situations. Barabaśi's group released the MN dataset as gzipped ASCII files. 15 The numbers of nodes (metabolites), input−output links (metabolic reactions), node degrees, topological indices, names, and codes of >40 bacteria species analyzed here appear in Tables S02 and S03 (Supporting Information S01). In the IF approach, the chemical compounds' structures of ADs (f k (D i ) values) were fused with structural information included in the MN datasets of the various species. All instances were assigned to one of two series: training (subset = t) or validation (subset = v). Sampling should be random, representative, and stratified to the greatest extent practicable. 43 To build triads, we randomly picked original data from the two datasets. Following that, cases were randomly assigned to set = t and set = v in proportions of 75 vs 25%. 43 The total of 154,214 compounds were divided into 115,662 for the training set and 38,552 for the validation set.
The output f(v ij ) calc was determined as a linear combination of scores for several c i , which is a generic term that indicates a variety of multioutput assay circumstances, such as targets, assays, organisms, and MNs. Moreover, c 0 is the biological activity v ij minimal inhibitory concentration (MIC (μg·mL −1 )) or minimal bactericide concentration (MBC (μg·mL −1 )), etc.; c 1 is the specific protein (ChEMBL database); c 2 is the assay organism in the experiment; c 3 is the MN microorganism species; c 4 is the target type; and c 6 comprises mappings to the ChEMBL targets. Table S04 in the Supporting Information S01 contains more information. The parameters f k , Δf k i (c q ), and ΔΔf k i (c q ) are the independent input variables, while f(v ij )= 1 is the input dependent variable. The molecular descriptors, D ik , of linear indexes based on atoms include f q (N, M, w) g for each chemical q. Eq 1 shows the general definition of linear indexes based on atoms (eq 1).  (1) where N 1 is the selected matrix norm (Manhattan distance), M denotes the graph-theoretic electrical density matrix, and w

Molecular Pharmaceutics
pubs.acs.org/molecularpharmaceutics Article denotes the physicochemical weight. In this scenario, the Ghose-Crippen Log P, the electronegativity, and the van der Waals volume were used. Finally, the following atom groups were estimated for the compounds H (A) bond acceptors, C atoms in the aliphatic chain (C), H link donors (D), C atoms in the aromatic part (P), and heteroatoms (X). 44 Next, we defined and determined the values of all vectors corresponding to the structural descriptors D dk and D sk for the two subsystems. Additionally, we defined and calculated the vector elements c dj and c sj with all AD and MN bacteria labels/ assay conditions. Following that, we transformed the estimated molecular descriptors D dk and D sk to Box−Jenkins MA operators. The PTOs estimated in this work include the chemical structure and/or physicochemical properties of the AD subsystem Δf(D dk ), as well as structural information about the bacteria's MN ΔΔf(D sk ). They were written in the form of deviation terms for each subsystem f(D dk ) and f(D sk ) with relation to the average value for the respective subsystems of reference: ⟨f(D dk ) cdj ⟩ and ⟨f(D sk ) csj ⟩. As a result, the initial terms f(D dk ) and f(D sk ) in these formulas denote the subsystem, while the averages denote the assay. The following equations were utilized (eqs 2 and 3).
In the Supporting Information S00, we detailed all fused datasets of drugs, and the PTO's values of the IF technique (training, validation, and screening sets).
The statistical parameter utilized to validate the model was the number of training examples (N), and the overall values of Model quality were determined using parameters such as sensitivity (Sn), specificity (Sp), Chi-square (χ 2 ), and the plevel. LDA algorithms were run using the STATISTICA 6.0 program. 47 Figure 2 shows the IFPTML processing information in a detailed workflow. 2.2.2. IFPTML Nonlinear Models. Next, we ran numerous nonlinear ML techniques built with the Waikato Environment for Knowledge Analysis (WEKA) software program, version 3.8.5. 48 We employed a total of 17 ML methods to construct these different nonlinear IFPTML classification models using the current dataset. Classifiers such as Bayesian networks, decision trees, ensemble approaches, rule-based classifiers, neural networks, and functions were included in this category. Each strategy employs a learning algorithm to determine the model that most closely matches the relationship between the input dataset and the class. Based on Bayes' theorem, the Bayesian Network K2/B (BN) and Nave Bayes network (NBN) classifiers were developed. The classification trees applied were Random Forest (RF) 49 and the pruned or unpruned C4.5 decision tree classifier (J48). 50 RF is an extension of Bagging, with the addition of randomized feature selection. It first selects a subset of features at random and then performs the traditional split selection technique inside the selected feature subset. 51 Different ensemble methods were used. They include metaalgorithms that aim to combine weak learners' skills such as bagging, boosting, voting, and stacking. In the first case, bagging methods are used to lower the variance of a base estimator (e.g., decision tree) before constructing an ensemble from it. They are a quick and easy technique to improve a single model without changing the fundamental base algorithm. 51 An implementation of CART (SimpleCart) was applied based on classifier trees in the Weka package. 52 The second group is the boosting algorithms that are capable of transforming weak learners into strong ones. Intuitively, a weak learner does little better than a random guess, whereas a strong learner performs almost perfectly. 51 In this work, we employed three exemplary algorithms from this family of algorithms: Adaboost, LogitBoost, and MultiBoosting. 53 These models were built in conjunction with classifier trees based on entropy (DecisionStump). Voting is a straightforward ensemble procedure that generates two or more submodels. Each submodel generates predictions that are combined in some manner, such as by computing the mean or mode of the forecasts, allowing each submodel to vote on the appropriate conclusion. 54 Finally, stacking is a universal technique that may be thought of as a straightforward expansion of voting ensembles, where an individual learner is combined. Individuals are considered first-level learners, while combiners are called second-level or meta-learners. 51 In this work, the meta classifier ZeroR was used as the base model.
Artificial neural network (ANN) classification is a nonlinear classification technique inspired by biological neural networks. Feature vectors are used to describe objects (compounds). Each characteristic is associated with a weight and is transmitted to an input neuron. Input is routed to the output layer via hidden layers based on these weights. 55 The output layer mixes these signals (e.g., activity or class prediction). Weights are initially set at random. The weights are changed as the network is fed data so that the overall output approximates the observed endpoint values for the chemicals. 56 In our work, the "hidden" layer was developed from 2 to 13 to predict the antibacterial compounds.
Other functions such as k nearest neighbors (KNN), binary logistic regression (BLR), and various support vector machines (SVMs) were implemented. KNN is a method for lazy learning that allocates novel compounds to the most prevalent class of known compounds in their near neighborhood, 57,58 and numerous parameter combinations have been established. The number of the nearest neighbors (k) varied between 1 and 20. In addition, we employed the four distances (Chebyshev, Edit, Euclidean, and Manhattan) of the LinearNNSearch in a feature space. BLR is an algorithm that can be used for predicting a categorical variable (e.g., Yes/No or Pass/Fail) 1 objective function obs biological activity   58 Finally, SVM is a method that works well with noisy data. 59 Identifying a stiff choice hyperplane that results in the highest potential margins across activity classes leads to models. Kernels can be used to translate nonlinear data to higher dimensions.
In the case of the rule-based classifiers, three methods were applied. PART is a decision list that constructs a partial C4.5 decision tree in each iteration and converts the best leaf to a rule; 60 Ripple-Down Rule (Ridor) learner constructs a default rule and then the exceptions to the default with the lowest (weighted) error rate. The exceptions are a set of rules that forecast classes other than those picked by default, 61 and Huḧn and Hullermeier proposed the Fuzzy Unordered Rules Induction Algorithm (FURIA), a revolutionary fuzzy rulebased categorization approach. 62 The performance metrics used were Area Under Receiver Operating Characteristic (AUROC), Accuracy (Acc), Sn, Sp, Precision, and F1 score.

Domain of Applicability (DoA).
Producing reliable forecasts necessitates an understanding of the model's constraints and applicability. The DoA can be established using either the leverage approach or similarity metrics based on Euclidean distances between all training and test composites. 63,64 We employed the leveraging technique. The residuals of the response variables were plotted against the leverages (the diagonal values of the hat matrix (h)) to visually define the DoA after computing the hat matrix for the structural domain (Williams plot). 65 Chemicals that exceeded specified threshold values were identified as outliers in terms of reactivity and leverage. Three residuals were used as response thresholds. Leverage was used to set the critical hat value (h* = 3(p + 1)/n, where p denotes the number of model descriptors and n is the number of training compounds). 65 Gramatica 66 classified (h > h*) as a structurally significant chemical. In addition to testing series, the DoA was performed for an external series composed of 224,719 compounds (without antibacterial activity).  Table 1, we show selected variables of the IFPTML-LDA model for different conditions used in the model. The criteria chosen are those that are expected to be more significant in terms of biological activity (AD vs MN).
The probabilities used a priori to fit the model were set π 0 (f(v ij = 0)) = π 1 (f(v ij = 1)) = 0.5. The molecular descriptors were transformed to Box−Jenkins moving averages. Two duplex linear indices atom-based level descriptors were used (with C atoms in an aliphatic chain and total (global)indices). In the first, nonstochastic matrix orders 2 and 3 were included in the model. In the second, the nonstochastic matrix order varied from 0 to 3 (more information is available in Table S5 in Supplementary Information S01). The output of the model v ij is a score function for the biological activity of the ith AD under various combinations of the assay conditions c sj and c dj . In this work, one chemical is classified as active based on its desirability d(c 0 ) of the biological characteristic v ij (c 0 ) and a preset cutoff value. The minimum inhibitory concentration (MIC) for biological activity v ij (c 0 ) was set to be less than 4213 g·mL −1 or less than the average for unmeasured characteristics.
When v ij > cutoff and the a priori desirability function d(c 0 ) = 1, the AD was regarded to be active (f(v ij ) obs = 1). Additionally, if v ij <cutoff and d(c 0 ) = −1, then f(v ij ) obs = 1; otherwise, (f(v ij )) obs = 0. When we aim to maximize the value of biological activity s ij (c 0 ), such as inhibition (%), the desirability is d(c 0 ) = 1. On the other hand, d(c 0 ) = −1 when the value of biological activity v ij (c 0 ) is desired to be minimized, for example, potency (nM), IC 50 (nM), K i (nM), or EC 50 (nM). Otherwise, when the necessity of maximizing or decreasing v ij (c 0 ) is ambiguous, the value of desirability was taken to be d(c 0 ) = 0. In any instance, the values of d(c 0 ) for the same property can be changed (swapped) to suit a particular circumstance. 67 Equation 5 contains a full explanation of the input variables analyzed, and the best model discovered has the following equation The model's statistical parameters are as follows: N is the number of training examples, χ 2 is the Chi-square statistics, and p is the p-level. As shown in eq 5, the parameters Δf 1 , Δf 2 , Δf 3 , Δf 6 , and ΔΔf 2 all have a negative effect on the numerical score of the biological activity; these parameters correspond to the boundary conditions for the measure, target, and data curation. On the other hand, the variables f(v ij ) ref , Δf 4 , Δf 5 , ΔΔf 1 , ΔΔf 3 , and ΔΔf 4 (protein, MN organism, and target type) all influence the activity positively. Additionally, we may obtain

Molecular Pharmaceutics
pubs.acs.org/molecularpharmaceutics Article the parameters that contribute most to the activity using this equation. In the case of ΔΔf 3 , the coefficient is 2.513, which is a very realistic value considering that the most significant variations in activity, even among identical compounds, are explained by the diverse techniques employed to assess the activity. The same holds true for the ΔΔf 2 parameter, which has a coefficient of 3.523 in the equation and contributes significantly in a negative way to activity. Molecular descriptors enable the indirect correlation of desired attributes with a molecule's structure. 68 Analysis of the structural interpretation of the IFPTML-LDA model showed that total and local linear indices (atom and atom type) are the most influential descriptors in the chemical datasets, specifically nonstochastic matrix orders 2 and 3 in the presence of C atoms in the aliphatic chain. These can be aliphatic hydrocarbons with only single covalent bonds (alkanes), hydrocarbons that contain at least one C−C double bond (alkenes), and hydrocarbons that contain a C−C triple bond (alkynes). The T-Total (Global) indices are included in the nonstochastic matrix orders: 0, 1, 2, 3. The structural significance of these descriptors can be illustrated in several ways: as (a) chain length effect, (b) branching effect, (c) multiple bond effect, and (d) heteroatom modification effect. 69 The influence of these structural characteristics on these molecular descriptors (which have mathematical linear map matrices) is referred to as graph-theoretic electronic structure models. 69 Specifically, zero-order total (and local) linear indices can be classified according to their "dimensionality" as one-dimensional (1D) descriptors. These include "bulk" properties and physicochemical properties (hydrophobicity, molecular polar surface area, molar refractivity, molecular polarizability, and sum atomic charge). 70 In general, these linear indices (total and local) include information about various structural changes in organic molecules, including chain-lengthening, branching, heteroatom content, and multiple bonds. However, most of the variables selected by the model only account for variability on linear indices (global indices/nonstochastic matrix).
The classification matrices for training and validation series are shown in Table 2. The results are summarized in terms of   3.2. IFPTML Nonlinear Models. Additionally, we trained a different form of IFPTML model utilizing a distinct class of machine learning methods. We utilized a total of 17 machine learning classifiers. The performance of these models is summarized in Table 3, and the findings are displayed graphically in Figures 3 and 4.
As expected, almost 10 of the 17 ML models displayed better Sn and Sp values than the IFPTML-LDA model. They are KNN, RF, Bagging, BN, J48-DT, Part, MLP, FURIA, Ridor, and LogitBoost. However, AdaBoost, BLR, SVM, MultiBoostAB, NBN, Stacking (ZeroR), and Vote showed a lower value of Sp than the IFPTML-LDA model. In the case of Stacking (ZeroR), Vote (Sn = 0%), and AUROC = 0.5, it indicates that classification is no better than random guessing. Thus, these techniques are not suitable for AD vs MN data processing. In terms of accuracy, the first 10 algorithms mentioned also presented good performance, with a global Acc = 80−97.4%, suggesting that this dataset herein is predominated by nonlinear classification.
Otherwise, in the validation set, the same algorithms KNN, RF, Bagging, BN, J48-DT, Part, MLP, FURIA, Ridor, and LogitBoost are higher than the IFPTML-LDA model. In addition, these techniques display satisfactory goodness of fit and goodness of prediction. They regularly outperform on both training and validation sets (see Table 3). The Sn rates for active and inactive classes are very high, indicating a significant discriminant capacity for future virtual screening applications.
In the training/validation set, the KNN, Bagging, BN, J48, PART, and RF show AUROC>0.95. The ROC curve is formed by graphing the true-positive rate versus the false-positive rate at various thresholds. Values close to 1 indicate that classification is almost perfect across all thresholds; thus, these six techniques are considered good classifiers for a dataset. They are the most accurate models as determined by a consensus examination of their general Acc and AUROC parameters. Nevertheless, the gain in performance from LDA    Figure 5, as a double ordinate plot of residuals test sets (first ordinate) and plot of residuals external validation (second ordinate) vs leverages (abscissa) (William Plot). Within the domain, the examples fall within a rectangular area defined by a band of two residuals and a leverage threshold of h = 0.00033. 71 As can be observed, the majority of validation examples fall inside this range. There are, however, a significant number of examples with leverage greater than the threshold but with standard residuals under the limits. In these instances, where the leverage value is greater than h*, the prediction should be regarded as untrustworthy. Values greater than the warning leverage (h*) indicate that the composite's expected reaction can be extrapolated from the model, and hence the predicted value should be used with extreme caution. As a result, there are no instances in either the training or prediction series where the residual values are greater than the range defined for residuals and residual LOO. As a result, there are no outliers reported and our model is capable of accurately predicting new chemicals in this DoA.
3.3. Comparison with Other Heterogeneous Series of Compounds Approaches. The linear and nonlinear IFPTML of the ADs vs MNs were compared with other reports based on a heterogeneous series of compounds previously described in the literature in regard to discovering antibacterial compounds. Table 4 shows a comparison between the present model and some of these models (heterogeneous series of compounds, drug family >10). An analysis of Table 4 reveals that the current work has the greatest dataset (very complex and notably larger dataset in the number of compounds). Only six previous models contain more than 10,000 molecules. Compared to previous models with a parameter count of 6−8, the model provided in this study has a considerable number of parameters (12). However, models 3, 5, and 6 show a greater number of variables: 62 72 and 21, 73 respectively.
The LDA predominates among the techniques used to realize the models (6 of the 13). Two models include KNN (model 3 and 5) 72,73 and ANN (model 4 and 10). 23,26 Even though SVM is analyzed in one model (model 6), 73 and the iterative stochastic elimination (ISE), 74 and self-organizing map (SOM) (Kohonen) 75 in the models 11 and 12, respectively. In the case of accuracy, it is worth noting that all models compared had precision values of more than 75%. However, the accuracy values of the RF and KNN techniques in this study (97.4%) are higher than those of other studies carried out with similar datasets, such as Nocedo et al. 76 (88.6%). The external predicting series was the most frequently used validation technique in 12 of 13 models, including this one. This demonstrates that we used a timetested validation technique. As illustrated in Table 4 (models 1−3, 7, and 8), the models are not able to predict multiple species; rather, they only predict one type of microorganism. Recently, multispecies models have been developed; some of them predict biological activity exclusively for members of the same genus or subgroup of bacteria (models 4 to 13), except for models 11 74 and 13, 76 14−16, which include multiple bacterial strains. Among them, the IFPTML models of our study (models 14−16) encompass the highest number of compounds and best accuracy (only model 7 77 is superior, but this work included only 7,517 compounds). In addition, our models include the prediction of antibacterial activity against various bacteria, including their MNs, which was only analyzed in model 13. 76

CONCLUSIONS
The use of broad-spectrum antibiotics has been linked to bacterial resistance to conventional antibiotics. Understanding  pathogen metabolism is critical for developing new medications and targets for antibacterial treatment. The impact of alterations in metabolic networks on the ability of various bacteria to survive has been demonstrated. In this research work, we developed an IFPTML-LDA model for predicting the antibacterial activity, which took into account the structure of MNs. The antibacterial activity and appropriateness of >155,000 biological experiments of >50,000 chemicals vs >25 different types of bacteria species were predicted using IFPTML-LDA models. Compared to other ML linear and nonlinear models (e.g., SOM models) presented in this work and in the literature, the model demonstrated strong predictive power (Sn, Sp, and Acc = 74%). Among the 17 ML algorithms employed in the development of nonlinear IFPTML classification models, the KNN, Bagging, BN, J48, PART, and RF models showed the highest AUROC, Accuracy, F1 score, Sn, and Sp values (>85% in training/validation sets). We can conclude that the IFPTML model stated could be a simple, valuable, and flexible tool, reducing time and costs in antibacterial drug investigation.
Statistics for multiple types of biological activity parameters in ChEMBL dataset (Table S01); details of the metabolic networks of >40 organisms (Table S02); average values of f k for the metabolic networks of >40 organisms (Table S03); conditions included in ChEMBL dataset of antibacterial drugs vs MRN analysis (Table S04); and linear index based on atoms descriptors included in the model (Table S05) (PDF) Search of organisms in the ChEMBL dataset using targets and assays (XLSX)